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Abstract 

Arc-annotated sequences are useful for represent- 
ing structural information of RNAs and have been 
extensively used for comparing RNA structures in 
both terms of sequence and structural similari- 
ties. Among the many paradigms referring to arc- 
annotate d sequences and RNA structures compari- 
son (see ()Blin et al.1 12008') for more details), the most 
important one is the general edit distance. The prob- 
lem of computing an edit distance between two non- 
crossing arc- annotated sequences was introduced in 
()EvanslTT999D . The introduced model uses edit op- 
erations that involve either single letters or pairs of 
letters (never consi dered separately) and i s solvable 
in polynomial-time (jZhang fc Shashalll9891 ). 

To account for other possible RNA structural evo- 
lutionary events, new edit operations, allowing to 
consider either silmutaneousl y or separa, tely letters 
of a pair were introduced in (j Jiang et al.| [2002): un- 
fortunately at the cost of computational tractabil- 
ity. It has been proved that comparing two RNA 
secondary structures using a full set of biologically 
relevant edit operations is N P-complete. Neverthe- 
less, in (jGuignon et al.ll2005f ). the authors have used 
a strong combinatorial restriction in order to compare 
two RNA stem-loops with a full set of biologically rel- 
evant edit operations; which have allowed them to de- 
sign a polynomial-time and space algorithm for com- 
paring general secondary RNA structures. 

In this paper we will prove theoretically that com- 
paring two RNA structures using a full set of biologi- 
cally relevant edit operations cannot be done without 
strong combinatorial restrictions. 

Keywords: RNA structures, Longest Arc-Preserving 
Subsequence (LAPCS), NP-Hardness, Stem- loops 

1 Introduction 

In computational biology, comparison of RNA 
molecules has recently attracted a lot of interest due 
to the rapidly increasing amount of known RNA 
molecules, especially non-coding RNAs. Very of- 
ten, arc -annotated sequences^ originally introduced in 
(|Evansl ll999). are used to represent RNA structures. 
An arc-annotated sequence is a sequence over a given 
alphabet together with additional structural informa- 
tion specified by arcs connecting pairs of positions. 
The arcs determine the way the sequence folds into a 
three-dimensional space. 
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The problem of computing an edit distance be- 
twe en two arc-a nnotated sequences was introduced 
in (jEvansI [19991) with a model that used only three 
edit operations (deletion, insertion and substitution) 
either on single letters (letters in the sequence with 
no incident arc) or pairs of letters (letters connected 
by an arc). In this model, the two letters of an 
arc are never considered separately, and hence the 
problem of computing the edit distance between two 
arc-annotated sequences becomes equivalent (when 
no pair of arcs are crossing) to the tree edit dis- 
tance problem , that can be solved in polynomial-time 
(|Zhang fc Sh asha 1989). 

To account for other possible RNA structural evo- 
lutionary events, new edit operations, such as cre- 
ation, deletion or modification of arcs between p airs 
of letters, were introduced in (jJiang et al.l 12002 ) at 
the cost of computational tractability. Inde ed, it 
has been shown in (Blin, Fertin, Rusu & Sin oque^ 
2007) that in case of non-crossing arcs, the prob- 
lem of computing the edit distance between two 
arc-annotated sequences under this model is NP- 
hard. Playing the game of applying constraints ei- 
ther on the legal edit operations or on the allowed 
alignments, several papers have shed new light on 
the borderline between tractability an d intractability 
(jGuignon et al.l pOOF. 'Blin et al."200g). Of particular 
importance, in |Guignon et al. 2005), the authors in- 
troduced the notion of conservative edit distance and 
mapping between two RNA stem-loops in order to de- 
sign a polynomial-time algorithm for comparing gen- 
eral secondary RNA structures using th e full set of 
biolo gical edit operations introduced in (jJiang et al.l 
I2002i ). This algorithm is based on a decomposition in 
stem-loop-like substructures that are pairwised com- 
pared and used to compare complete RNA secondary 
structures. As mentionned in (Guignon et al. 2005), 
whereas in the very restrictive case of conservative 
distance and mapping, the computation of the gen- 
eral edit distance is polynomial-time solvable, it is not 
known if the general, i.e., not conservative, edit dis- 
tance between two stem-loops can be also computed 
in polynomial-time. 

In this paper, we will show that this strong combi- 
natorial restriction was necessary for the problem to 
become polynomial since it is NP-hard in the general 
case. Despite the fact that this result may be consid- 
ered as purely theoretical, it proves that comparing 
two RNA structures using a full set of biologically rel- 
evant edit operations cannot be done without strong 
combinatorial restrictions. 



2 Preliminaries 

Given a finite alphabet E, an arc-annotated sequence 
is formally defined by a pair (5', P), where S* is a string 
of S* and P is a set of arcs connecting pairs of letters 
of S. In reference to RNA structures, letters are called 



bases. Bases with no incident arc are caUed single 
bases. In an arc-annotated sequence, two arcs ji) 
and (12,^2) are crossing, if ii < 12 < ji < j2 or i2 < 
ii < j2 < ji- An arc (ii, ji) is embedded into an other 
arc ( i2,.72) if »2 < H < ji < 32- Evans ljEvansl ll999') 
(see (|Guignon et ah 2005) for extensions) introduced 
five different levels of arc structure: Unlimited - no 
restriction at all; Crossing - there is no base incident 
to more than one arc; Nested - there is no base inci- 
dent to more than one arc and no two arcs are cross- 
ing; Stem - there is no base incident to more than 
one arc and given any two arcs, one is embedded into 
the other; Plain - there is no arc. There is an obvi- 
ous inclusion relation between those levels: Plain C 
Stem c Nested c Crossing c Unlimited. An 
arc-annotated sequence (S'i,Pi) is said to occur in 
another arc-annotated sequence (52,^2) if one can 
obtain the former from the latter by repeatedly delet- 
ing bases (deleting a base that is incident to an arc 
results in the deletion of the arc). 

Among the many pa radigms referrin g to arc- 
annotated sequences (see (jBlin et al.ll2008[ ) for more 
details) we focus in this article on the Longest 
Arc-Prese rving Common Subsequence (Lapcs 
for s hort) ((EroSa [19991 IJiang et all [200l ILin et al.l 
120021) and th e gener al edit distance (Edit for short) 
(IJiang et all l200l IBlin. Fertin. Herrv fc Vialettel 
12003). Indeed, as shown in ( jBlin et al.l l2008l). those 
two paradigms are quite related since the Lapcs 
problem is a special case of Edit when consider- 
i ng the cornplete set of edit operations defined in 
(j Jiang et al.ll2003 ). Therefore, the hardness results 
for Lapcs stands for Edit. 

Formally, the Longest Arc-Preserving Com- 
mon Subsequence problem is defined as fol- 
lows: given two arc-annotated sequences {Si, Pi) 
and (S'2,P2), find the longest - in terms of se- 
quence length - common arc-annotated subsequence 
that occurs in both {S^,P^) and {82, P2)- It bas 
been shown in (j Jiang et alj 120021 ) that the Lapcs 
problem is NP-hard even for Nested structures, 
i.e., Lapcs(Nested, Nested). Sti ll focussing on 
Nested structures, Alber et al. ( Alber et al.l 120041 ) 
proved that the Lapcs(Nested, Nested) problem 
is solvable in 0{'i^ \T,\^ kn) time, where n is the 
maximum length of the two sequences and k is the 
length of the common subsequence searched for. The 
0{'i^ lEj*^ kn) time parameterized algorithm by Alber 
et al. is by brute-force enumeration: (i) Generate 
all possible sequences of length k with all possible 
Nested arc annotations, and (ii) For each of these 
arc-annotated candidate sequences, check whether or 
not it occurs as a pattern in both 5*1 and ^2. At 
the heart of this approach is the fact that it can 
be decided in 0{nk) time whether or not this se- 
quence occurs as an arc-pre serving common subse- 
quence (jGramm et al.l I2006D . It is easily see that 
the above algorithm reduces to 0{2^''~^ km) time for 
Lapcs(Stem, Stem). Indeed, there exist se- 
quences of length k and hence, for a given sequence 
of length k, there exist (^) different arc- annotations 

with i arcs. Therefore, there exist J2i=o^ (2*^) = ^''"^ 
arc-annotations of a given sequence of length k. 

Here, we focus on the only remaining open prob- 
lems concerning Lapcs and Edit over stem-loops by 
showing, with a unique proof, their hardness. More 
precisely, we prove that Lapcs(Stem, Stem) - which 
may be considered as a very restricted problem and 
thus not interesting - is NP-hard in order to infer 
the NP-hardness o f Edit(Stem, Stem) - which is for 
sure, according to (jGuignon et al.|[2005l ). an interest- 
ing problem that can be used in a very simple way to 
compare complete RNA secondary structures. This 



results also prove that in any future work on compar- 
ing RNA structures with a full set of edit operations 
it will be necessary to introduce strong combinatorial 
restrictions in order to get an exact polynomial-time 
algorithm since even with the simplicst model, the 
general edit distance problem is still NP-complete. 

3 Comparing RNA Stem-Loops is NP- 
complete 

In this section, we prove that Lapcs over stem-loops 
(Lapcs(Stem, Stem)) is NP-complete (in Theo- 
rem [it; therefore a nswering an open question of 
([Guignon et al.l 120051 ) . This last result induces the 
NP-hardness of Edit over stem-loops. 

Theorem 1. Lapcs(Stem, Stem) is NP-complete. 

Corollary 1. Comparing RNA structures with a full 
set of biologically relevant edit operations cannot be 
done without introducing strong combinatorial restric- 
tions. 

In the following, we consider the decision version of 
the problem which corresponds to deciding if there ex- 
ists an arc-preserving common subsequence of length 
greater or equal to a given parameter k' . 

It is easy to see that the Lapcs problem is in 
NP. In order to prove its NP — hardness, we de- 
fine a reduction from the NP-complete 3S AT problem 
(jGarev fc Johnsonlll979l ) which is defined as follows: 
Given a collection Cq = {ci, C2, . . . , c,} of 9 clauses, 
where each clause consists of a set of 3 literals (repre- 
senting the disjunction of those literals) over a finite 
set of n boolean variables Vn = {xi, X2, ■ ■ ■ , Xn}, is 
there an assignment of truth values to each variable 
of Vn s.t. at least one of the literals in each clause is 
true? 

Let {Cq, Vn) be any instance of the 3SAT problem 

s.t. Cq = {Ci, C2, . . . , Cq} and Vn ^ {Xi, X2, Xn}. 

For convenience, let denote the j*'' literal of the 
i*'' clause (i.e. q) of Cq. In the following, given a 
sequence S over an alphabet S, let %(«, c, S) denote 
the i"* occurrence of the letter c in S. 

We build two arc-annotated sequences (^i. Pi) and 
{82, P2) as follows. An illustration of a full example 
is given in Figures [T] and [H where n = 4 and g = 3. 
For readability reasons, the arc- annotated sequences 
resulting from the construction have been split into 
several parts and a schematic overview of the overall 
placement of each part is provided. 

Let Si = ClWqCl_i...ClW2ClWiS\tViPlV2 
Pi... Pl_iVqPl and S2 = ClWqCl_i . . . CIW2CIW1 
Slj ViPlV2Pi . . . P^.iVqP^ such that for all 1 < i < 
q,l < k < n, 

• C} = RfQ^RjQ^XlXl . . . XlQ,RlQ,Rl with 

xl = XkSjXk if Xk = Lj or Xk = Lj; xl = XkXk 
otherwise; 

• Pi = Qq+iQq+iPq+i^n ■ ■ ■ Xl^j^Rq^iXl . . . Xl 
R\+iQq+iQq+t such that Xl = X^Xk] 

_ /^2 _ v2 V2 r3/0 y2 y2 t>2 y2 y2 

• Gj - . . . A„Kj(^jAi . . . AnKj A„_^^ . . . A^ 

QiRlXl...Xl such that for 1 < j < 3, 
x{j,Xl,Cf) = Xfc^feSj (resp. SjXkX^) \i Xk = Ll 
(resp. = L^i); x{j, ^l, Cf) = XkX^ otherwise; 

• Pf = ■ ■ ■ ^iR\+iQq+i^n ■ ■ ■ A^l+i^^+i 
Xl... XlQq+,Rl^^Xl ...Xl with Xl = x^xfe. 



Moreover, let S]^.j — xia;iX2a:2 ■ • ■ XnXn and Sf j = 
X1X1X2X2 ■ ■ -x^Xn- Notice that, by construction, 
there is only one occurrence of each {si, S2, S3} in Cf. 

For all 1 < i < q, let Qi (resp. Qq+i) be a seg- 
ment oi n + 1 symbols yi (resp. Vq+i)- Moreover, 
for all 1 < i < g, let Wi (resp. Vi) be a segment of 
20{max{q,n\^) symbols Wi (resp. Vi). Let us now 
define Pi and P2. 

For all 1 < i < q — 1, (1) add an arc in Pi between 

x{l,Xk,Cl) (resp. x(l,^,C'^)) and x(l, 2:fe, i^Vi) 
(resp. x(l; VI < /c < n (see Figure [T]d 

andlUb); (2) add an arc in P2 between x{ji^k,Cf) 
(resp. xU^'^^Cf)) and x((4 - j), a:^fc, Jf ) (resp. 
X((4 - j),x^,Pi)), VI < fc < 71 (see Figure fflc, Ela 
and[2]c); (3) add an arc in P2 between xi^^RiiCf) 
and xi'^,Ri+^,P^ ), VI < j < 3 (see Figure fflc, Oa 
and He). 

Clearly, this construction can be achieved in 
polynomial-time, and yields to sequences {Si, Pi) and 
{S2, P2) that are both of type Stem. We now give an 
intuitive description of the different elements of this 
construction. 

Each clause S Cq is represented by a pair 
{C},Cf) of sequences. The sequence Cf is composed 
of three subsequences representing a selection mech- 
anism of one of the three literals of q. The pair 
{S\j,S\f) of sequences is a control mechanism that 
will guarantee that a variable x^ cannot be true and 
false simultaneously. Finally, for each clause G C,, 
the pair {Pl.Pf) of sequences is a propagation mech- 
anism which aim is to propagate the selection of the 
assignment (i.e. true or false) of any literal Xk all over 
Cq. Notice that all the previous intuitive notions will 
be detailed and clarified afterwards. 

In the rest of this article, we will refer to any such 
construction as a snail- construction. In order to com- 
plete the instance of the Lapcs(Stem, Stem) prob- 
lem, we define the parameter k' = AQq(raax{q., "^A^) + 
dqu -|- 8g -|- n which corresponds to the desired length 
of the solution. In the following, let {Si, Pi) and 
(^2, P2) denote the arc-annotated sequences obtained 
by a snail-construction. We will denote Sd the set of 
symbols deleted in a solution of Lapcs problem on 
{Si, Pi) and (52,^2) (i-c. the symbols that do not 
belong to the common subsequence). 

We start the proof that the reduction from 3SAT 
to Lapcs(Stem, Stem) is correct by giving some 
properties about any optimal solution. 

Lemma 1. In any optimal solution o/Lapcs problem 
on {Si, Pi) and (5*2,^2), o^t least one symbol incident 
to any arc would be deleted. Moreover, all the symbols 
of Vi and Wi, for I < i < q, will not be deleted. 

Proof. By contradiction, let us suppose that there 
exist at least one arc s.t. the two symbols inci- 
dent to this last are not deleted in a solution of 
Lapcs problem on (6*1, Pi) and (5*2, ^2)- Then, by 
construction, it induces that at least one complete 
sequence Vj or Wj, for a given 1 < j < q, has 
been deleted. Since they have the same length, 
we will consider w.l.o.g. afterwards that Vi has 
been deleted. Therefore, since 5*1 is, by construc- 
tion, smaller than ^2 the length of this optimal so- 
lution is at most IS-il - |V,| = ELid^i^l + \Pl\ + 
\V\ + m\) + \Sl,\ - \V,\ = ELi((6n + 11) + (6n + 
7) + {20{max{q,Ti}^)) + {20{max{q,n}'^))) + 2n - 
{20{max{q,n}'^)) = q[12n + 18 + 'iO{max{q,n}^)] + 
2n — {20{max{q,n}^)). Then, in order for this 
solution to be optimal, one should have q[12n + 
18 + 'iO{max{q,n}^)] + 2n - {2Q{max{q,nY)) > 



A{)q{max{q, n^) + 6qn + 8q + n. This can be reduced 
to 6qn + IQq — 20{max{q, n}^) -|- n > 0. But, one can 
easily check that for any n > 3 (which is always the 
case in 3SAT instances), this is not true; a contradic- 
tion. □ 

Lemma 2. Any optimal solution of Lapcs 
problem on {Si, Pi) and (6*2,^2) is of length 
40q{max{q, n}^) + 6qn + 8q + n. 

Proof. By construction, in 5*1 there is (1) VI < 
i < n, 2q + 1 occurrences of Xi (resp. Xi); 
(2) VI < i < g, 4 occurrences of Qi (resp. 
Qq+i); (3) VI < i < g, 1 occurrence of each 

{R}, Rq+i, Ri, Rq+i, Rq+i^ Wi, Vi, Sl,S2,S3}; (4) VI < 

1 < q, 2 occurrences of i?f . 

Whereas, in S2, there is (1) VI < i < 
n, 6(7+1 occurrences of Xi (resp. xl); (2) 
VI < i < g, 2 occurrences of Qi (resp. 
Qq+i); (3) VI < i < g, 1 occurrence of each 

{R},Rl Ri 1 Rq+i 1 Rq+i ' Rq+i ' 

Wi,Vi,Si,S2,Sz]. 

Therefore, in any optimal solution there may be 
only (1) VI < i < n, 2q + 1 occurrences of Xi 
resp. Xi); (2) VI < i < g, 2 occurrences of Qi 
resp. Qq+i); (3) VI < « < g, 1 occurrence of each 
{Rj , R"^ , , Rlj^_i, R'^_^_i, R^g_^_^,W^,Vi, si, S2, S3} . 

More precisely, by Lemma ffl and since, by con- 
struction, there is an arc in P2 between x{^, Rj^Cf) 
and x{^,Rl+^,P^), Vj e {1,2,3}, in any opti- 
mal solution, VI < * < <Z, only half of the 
{R}, Ri,Ri, R\^i,Rq^i, R^+i) may be conserved. 

Moreover, any Xi (resp. 'xt) oi Si except in C^, is 
linked by an arc to another Xi (resp. Ixl), therefore 
by Lemma [U in any optimal solution, VI < i < g — 1, 
only half of the occurrences of Xi (resp. xt) may be 
conserved. 

Finally, in any optimal solution, only half of the 
occurrences of {xi,a?7} and one over {si, S2, S3} in 
and S\,j may be conserved. Indeed, by construction, 
if this is not the case in (resp. 5"^), it impHes that 

at least one complete sequence Qq (resp. Vi or Wi) 
is totally deleted - which is not optimal since it is of 
length n + 1 (resp. 2{){max{q, n}^)). 

On the whole, the maximal total length of any 
solution is thus equal to AQq{max{q, n}'^) + Qqn + 8q + 
n. Moreover, this solution is composed of (1) VI <i < 
n, 2q+l occurrences of either Xi or a^, (2) VI < z < q, 

2 occurrences of Qi and Qq+i, (3) VI < i < g, 1 
occurrence of each {Wi,Vi} and either si, S2 or S3 
and (4) VI < z < g, R^,R^,R^^ s.t. {ji,j2,j3} = 
{1,2,3}. □ 

Lemma 3. In any optimal solution of Lapcs prob- 
lem on {Si, Pi) and (£'2,^2), if x{^,Xk, Slj) (resp. 
x{l,Xk, Sli)) for a given 1 < k < n is deleted then, 
VI < j < q, x{^,Xk,Cj) (resp. x{^,x]:,Cj)) is 
deleted. 

Proof. By construction, VI < fc < n only one of 
{xfc, affc} may be conserved between 5^ and Sf^ since 
x{l,Xk,SM) < x(1,^,S'm) whereas x(l,^, 'S'lf) < 
x{l,Xk, S^j). By Lemma [U at least one symbol in- 
cident to any arc is deleted. Therefore, VI < fc < n 
only one of {xk,'Xk} may be conserved between Cl 
and Cf. 

Let us suppose that for a given 1 < k < n, 
x{l,Xk, SIj) is deleted. According to the proof of 
LemmalU in any optimal solution, VI < A; < n exactly 
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Figure 1: Considering Cq = (xi V X2 Vxa) A (3ri"V3?2 VX4) A (x2 Vaii Vaij). For readability, all the arcs have not 
been drawn, consecutive arcs are representing by a unique arc with lines for cndpoints. Symbols over a grey 
background may be deleted to obtain an optimal LAPCS. a) A schematic view of the overall arrangement of 
the components of the two a. a. sequences, b) Description of S\j, S]^, , and the corresponding arcs in 
Pi. c) Description of Cl, Cf, P/, Pf and the corresponding arcs in P2. d) Description of Cl, Cf, P2, P| and 
the corresponding arcs in Pi. 




Figure 2: Considering Cq = {xi V 2:2 V xz) A {xi V a;2 V X4) A (x2 V X3 V Xi). For readability all the arcs have 
not been drawn, consecutive arcs are representing by a unique arc with lines for endpoints. Symbols over 
a grey background may be deleted to obtain an optimal LAPCS. a) Description of C2, Cf, P2 ) ^2 ^"^^ t^^^ 
corresponding arcs in P2. c) Description of C|, P^, P3 and the corresponding arcs in Pi. d) Description 
of C| , C| , PI , P| and the corresponding arcs in P2 • 



one of {xk,Xk} has to be deleted. Then 
is deleted whereas is conserved. 

By construction, in , since according to the 
proof of Lemma [21 both occurrences of Qg+i and 

i?f ,i?f s.t. {ju.n,j3} = {1,2,3} have 

to be conserved, either (1) {Rl, Rf, R^^-^^}, (2) 
{Rl,Rl,Rl+i} or (3) {i??, ijf , iJ^+J are conserved. 

Let us first consider that {Rl, Rf, Rg_^_l} are con- 
served. Then one can check that the only solution 
is to conserve x(2,-Ri,C|) since otherwise at least 
half of the Xk^s would not be conserved. Conse- 
quently, the only solution is to conserve, VI < fc < n, 
the first (resp. last) occurrence of any Xk or 
in Ci (resp. Pi) - i.e. the occurrences appear- 

iiig before x(l,Qi,<7i) (resp. after x(2, Pi )). 

Since by construction, there is an arc between 
x(l,a;fc,Ci2) Jresp. x(l,^,C'?)) and^x(3, a;fe, ) 
(resp. x(3,^,Pi)), in order for x(l,^, P^) to be 
conserved, one has to conserved x(3,a^, Pi). Thus, 
by Lemma [11 x(l,^, Cf) has to be deleted and, ac- 
cording to the proof of Lemma [2l x(l7a;fe, Ci) has to 
be conserved. 

Let us now consider that {Rl, Rf, R^_^,i} are con- 
served. By a similar reasoning, one can check 
that the only solution is to conserve, VI < /c < 
n, the second occurrence of any x^ or x^T in Ci 
(resp. Pi^) - i.e. the occurrences appearing between 
x(l,Qi,Ci2) and x(2,Qi,Ci2) (resp. x(l, Pf) 
and x(2, Qg+i, Pi^))- Since by construction, there 
is an arc between x(2,a;fc,Ci) (resp. x(2,Sfc, Ci)) 
and x(2,a:^fc,Pi^) (resp. x(2,5fe, Pf)), in order to 
x(l,Xfe, Pi) to be conserved, one has to conserved 
x(2,Xfe, Pf). Thus, by LemmalU x(2,^, Ci) has to 
be deleted and, according to the proof of Lemma ^ 
X(2,a;fe,Ci) has to be conserved. 

Finally, let us consider that {Rl, Rf, R^^i} are 
conserved. Once again, by a similar reasoning, 
one can check that the only solution is to conserve 
X(l,Pi,Ci) since otherwise at least half of the x^s 
would not be conserved. Consequently, the only solu- 
tion is to conserve, VI < fc < n, the last (resp. first) 
occurrence of any Xk or Xk in Ci (resp. Pi) - i.e. 
the occurrences appearing after x(2,Qi,Ci) (resp. 
before x(l7 Qg+i, Pi ))■ Since by construction, there 
is an arc between x(3,a;fe,Ci) (resp. x(3,^, Ci)) 
and x{^,Xk,P?) (resp. x(l,^.Pf)): in order to 
X(l,Xfc, Pi) to be conserved, one has to conserved 
x(l,a^, Pi^). Thus, by LemmalU x(3,aJfe, Ci) has to 
be deleted and, according to the proof of Lemma [H 
x(3,a;fe,Ci) has to be conserved. 

Therefore, in the three cases, if for a given 1 < fc < 
n, x(l, Xk, Sl^) is conserved then so does x(l: Xk,Cl). 
It is easy to see that, by a similar reasoning, if for a 
given 1 < fc < n, x(Ii^i'S'm) is conserved then so 
does x(l,^,,C'^^). 

With a similar reasoning, by reccurence, since, 
VI < i < g, 1 < fc < n, there is an arc in Pi between 
x(l,a;fe,C/) (resp. x(l,^,C'i^)) and xi'^, Xk, Pl+i) 
(resp. x(l,^i Pi+i))j if x(l, Cj^) is conserved then 
x(l, a;fc, P/^i) is deleted. And therefore, with similar 
arguments, x(lj Xk, C^+i) is conserved. Once more, it 
is easy to see that this result still holds if x(l,^, Cf) 
is conserved. □ 

Theorem 2. Given an instance of the problem 3SAT 
with n variables and q clauses, there exists a satisfy- 
ing truth assignment iff the Lapcs of (5i,Pi) and 



{82, P2) is of length k' = AQq(max{q,n}^) + Qqn + 
8q + n. 

Proof. (^) An optimal solution for Cq — {xi V 2:2 V 
X3)A(aJTVx2 Va;4)A(a;2 VIesV^) - i.e. xi ^ x^ — true 
and X2= xa = false - is illustrated in Figures [H and 
[31 where any symbol over a grey background have to 
be deleted. Suppose we have a solution of 3SAT, that 
is an assignment of each variable of Vn satisfying Cq. 
Let us first list all the symbols to delete in ^i . 

For all 1 < fc < n, if Xk — false then 
delete, VI < j < g, {x{l,Xk,C]), x(l,5^,P/)} 
and xi^iXk, S\j); otherwise delete, VI < < q, 
{x(l,^,C'j), x{'^,Xk,Pj)} and x(1,^,<5'm). 

For each satisfying Ci with the biggest index j 
with 1 < i < q, 

if (1) J = 1 then delete {x(l, Pf , C}), x(l, Q^, C}), 
x(l,Pf,Ci), x(2,g.,Ci), x(l,S2,Ci), x(l,S3,Ci), 
x(i,P^+„i^i), x(l,Pj+.,P/), X(3,Q,+„P/), 
X(4,Q,+„P/)} (cf Figure ffla); 

if (2) J - 2 then delete {x(l, P?, C}), x(2, Q^, Cf), 
x(l,si,Ci), x(l,S3,Ci), x(3,Q.,Ci), x(2,Pf,Ci), 
x{2,Qq+,,Pl), x{l,Rl+^,Pl), X(1,PJ+„P/), 
x{i,Qq+^,Pl)] (cf Figure Ha); 

if (3) J = 3 then delete {x(l, si, C}), x(l, S2, C/), 

x{i,Q^,cl), x(2,Pf,ci), x{^Q^,cl), x(i,P.hci), 

X{l,Qq+i,Pl), x(.2,Q,+r,Pl), Xa,R'q+r^Pl), 

x{l,Rl+,,Pl)} (cf Figure [He); 

Let us now list all the symbols in S2 to be deleted. 

For all 1 < fc < n, if Xk — false then delete 
x{l,Xk,Sli); otherwise delete x{^,Xk,Slj). 

For each satisfying a with the biggest index j 
with 1 < i < q, 

if (1) j = 1 then delete VI < fc < n {x(l, Pf , Cf), 
x{l,S2,Cf), x{2,Xk,Cf), x{2,x^,Cf), x(l,S3,C2), 
x{3,Xk,Cf), x(3,a^,C2), x(l, a:^, P,^), x(l,^,P?), 

x{i,Rl+,,Pl), x{i,Rl+,,Pl), x^Xk^Pf), 

x(2, aJfc, P;'^)}. Moreover, if Xk — false with 
1 < fc < n then delete, {x(l, Xfe, Cf), x(3,xr,p2)}; 
otherwise delete 

{x(l,aJr,C2), x{i,Xk,Pf)} (cf Figure ma); 

if (2) j = 2 then delete VI < fc < n {x(l, Pf , C^), 
x(l,si,C2), x(l,^fe,C'2), x(l,5^,C2), x(l,S3,C2), 
x{i,Xk,Cf), x(3,^,C2), x{^,Xk,Pl), x(l,^,P?), 
x{l,B^+,,Pf), xil,Rl+^,Pf), x{3,Xk,P^), 
x{3,Xk,Pi)}- Moreover, if Xk = false with 
1 < fc < n then delete, {x(2, Xfe, Cf), xi2,x^,P?)}; 
otherwise delete 

{xi2,x^,Cf), xi2,Xk,Pi)} (cf Figure Ha); 

if (3) j = 3 then delete VI < fc < n {x(l, RlCf), 
xihsi,Cf), x{hxk,Cf), x{l,x^,Cf), x(l,S2,Cf), 
x{2,Xk,Cf), x{2,x^,Cf), x{2,Xk,Pf), x{2,x^,P^), 
x{i,Rl+,,P^), x{l,Rl+^,Pf), x{i,Xk,P?), 
x(3, aJfe, Pj'^)}. Moreover, if Xk — false with 
1 < fc < 71 then delete, {x(3, Xfe, Cf), x(l, x^, i^')}; 
otherwise delete 

{x(3,x^,Cf), xi:^,Xk,Pi)} (cf Figure [Ic); 

By construction, the natural order of the symbols 
of 5*1 and iS'2 allows the corresponding set of undeleted 
symbols to be conserved in a common arc-preserving 
common subsequence between (5*1, Pi) and (S'2,P2)- 
Let us now prove that the length of this last is fc'. 
One can easily check that this solution is composed 
of VI < fc < n, (1) 2q -I- 1 occurrences of either Xk 
or Xk, (2) VI < i < q, 2 occurrences of Qi and Qq+i, 
(3) VI < i < g, 1 occurrence of each and 



either si, S2 or S3 and (4) VI < i < g, Hl'^li^,!^^ 
s-t- {ji,j2,j3} = {1,2,3}. Thus, the length of the 
solution is AQq{max{q^ n}^) + &qn + 8g + n. 

(<^=) Suppose we have an optimal solution - i.e. a 
set of symbols Sd to delete - for Lapcs of {Si, Pi) 
and (>S'2,P2)- Let us define the truth assignment of 
14 s.t., VI < I < g, if x(l,Si,Ci) ^ Sd then is 
true. Let us prove that it is a solution of 3SAT. 

By construction, if — Xk (resp. Tk) then in C}, 
Sj appears between Xk and Xk whereas in C| it ap- 
pears after Uck (resp. before Xk)- Thus, if x(l, Sj: C}) 
is not deleted then (resp. Xk) in Cf is deleted if 

— Xk (resp. S^). Consequently, according to the 
proof of Lemma m if x(1j Sj.C}) is not deleted then 
Xk (resp. Xk) in all C},, with 1 < i' < g is deleted 
if Ll = Xfc (resp. xTk). Therefore, we can ensure 
that one cannot obtain Ll and L\, being true whereas 

= Lj, (that is a variable cannot be simultaneously 
true and false). By Lemma [21 we can ensure that for 
any 1 < i < q exactly one of {si, S2, S3} is conserved 
in Cl- Therefore, for any clause Ci at least one of its 
literal is set to true. This ensures that our solution is 
a solution of 3SAT. □ 

4 Future work 

From a computational biology point of view, espe- 
cially for comparing stems, one may, however, be 
mostly interested in the case k (length of the com- 
mon subsequence searched) might not be assumed to 
small compared to n. A first approach is provided in 
(|Alber et al.ll2004 ) where it is proved that, given two 
sequences of length at most n and nested arc struc- 
ture, an arc-preserving common subsequence can be 
determined (if it exists) in 0(3.3l'^i+'^^ n) time; ob- 
tained by deleting (together with corresponding arcs) 
ki letters from the first and ^2 letters from the second 
sequence. Improving the running time of the param- 
eterization in case of stem arc structures appears to 
be a promising line of research. 

References 

Alber, J., Gramm, J., Guo, J. & Niedermeier, R. 
(2004), 'Computing the similarity of two sequences 
with nested arc annotations'. Theoretical Computer 
Science 312(2-3), 337-358. 

Blin, C, Denisc, A., Dulucq, S., Hcrrbach, C. & 
Touzet, H. (2008), 'Alignment of RNA structures', 
IEEE/ACM Transactions on Computational Biol- 
ogy and Bioinformatics . To appear. 

Blin, G., Fertin, G., Herry, G. & Vialette, S. (2007), 
Comparing rna structures: towards an intermedi- 
ate model between the edit and the lapcs prob- 
lems, in M.-F. Sagot & M. E. Telles Walter, 
eds, '1st Brazilian Symposium on Bioinformatics 
(BSB'07)', Vol. 4643 of Lecture Notes m Bioin- 
formatics, Springer- Verlag, Angra dos Reis, Brazil, 
pp. 101-112. 

Blin, G., Fertin, G., Rusu, I. & Sinoquet, C. (2007), 
Extending the hardness of rna secondary structu, 
in B. Chen, M. Paterson & G. Zhang, eds, '1st 
intErnational Symposium on Combinatorics, Algo- 
rithms, Probabilistic and Experimental methodolo- 
gies (ESCAPE'07)', Vol. 4614 of LNCS, Springer- 
Verlag, Hangzhou, China, pp. 140-151. 



Evans, P. (1999), Algorithms and Complexity for An- 
notated Sequences Analysis, PhD thesis. University 
of Victoria. 

Carey, M. & Johnson, D. (1979), Computers and 
Intractability: a guide to the theory of NP- 
completeness, W.H. Freeman, San Franciso. 

Gramm, J., Guo, J. & Niedermeier, R. (2006), 'Pat- 
tern matching for arc-annotated sequences', ACM 
Transactions on Algorithms 2(1), 44-65. To ap- 
pear. 

Guignon, V., Chauve, C. & Hamel, S. (2005), An edit 
distance between rna stem-loops, in M. P. Consens 
& G. Navarro, eds, '12th International Conference 
SPIRE', Vol. 3772 of LNCS, pp. 335-347. 

Jiang, T., Lin, G., Ma, B. & Zhang, K. (2002), 'A gen- 
eral edit distance between RNA structures', Jour- 
nal of Computational Biology 9(2), 371-388. 

Jiang, T., Lin, G., Ma, B. & Zhang, K. (2004), 
'The longest common subsequence problem for 
arc-annotated sequences'. Journal of Dicrete Algo- 
rithms pp. 257-270. 

Lin, G., Chen, Z.-Z., jiang, T. & Wen, J. (2002), 
'The longest common subsequence problem for se- 
quences with nested arc annotations'. Journal of 
Computer and System Sciences 65, 465-480. 

Zhang, K. & Shasha, D. (1989), 'Simple fast al- 
gorithms for the editing distance between trees 
and related problems', SIAM journal of computing 
18(6), 1245-1262. 



